Day 3: Analysing single-cell data

Rasa Elmentaite & Gerda Kildisiute, 2021

Goals

By the end of this tutorial you should be able to:

Input dataset

Data produced in a single cell RNA-seq experiments gained popularity over its predicesor bulk RNA-seq due to its ability to measure heterogeniety of cells in the sample. Two characteristics that are important to keep in mind when working with scRNA-seq are drop-out (the excessive amount of zeros due to limited mRNA) and the potential of biology to be confounded by technical factors (e.g. batch, donor ect.).

For this tutorial we will use scRNA-seq data from patients with severe Coronavirus disease 2019 (COVID-19), the global pandemic caused by SARS-CoV-2 virus. The data includes a cell atlas of the bronchoalveolar lavage fluid immune cells from healthy and COVID19 patients published in Nature medicine (https://www.nature.com/articles/s41591-020-0901-9).

To reduce running time, we will use a subset of the atlas that includes a total of 60,382 cells from 2 healthy (H1- donor C51, H2- donor C52) and 2 COVID19 patients (S1- donor C143, S2 - donor C145).

Introduction

In the previous session, you have learned how to generate matrix files from FASTQ files. In this session, we will begin by loading data, inspecting the count matrices and creating data object with counts, genes and metadata.

This session will also guide you to how to check the quality of your cells. This includes: checking the quality of the generated count data, preprecessing the data for analysis and visualising them by clustering. This is most basic workflow that you will run in any single-cell data analysis.

There are many packages, which have collection of functions for running this workflow. Such as Seurat, Scanpy, Monocle3. They offer streamlined workflow and tutorials that can be easily followed.

Today, we will use Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R.

Pre-processing and dimensionality reduction

1. Different Single cell data objects

1.1 Load the relevant packages.

You can check package versions using sessionInfo() for future reproducibility of your results.

1.2 Read counts matrix

Typically the alignment tool (e.g. cellranger) output files include matrix.mtx, barcodes.tsv, genes.tsv files. These can be uploaded using Read10X() function in Seurat. Let's first load and examine the healthy donor run in our hc1 directory.

Note: Change the directory name to mydir/ where you store your data! Currently this is done for you, but if you are doing this practical outside the VM (e.g. you downloaded the data to your own machine and are running it there) don't forget to change your paths.

1.3 Examine counts matrix

Here we see the counts matrix has 33538 genes (rows) and 11115 cells (columns).

Storing the matrix in a dense format (where all count values including zeros are stored) takes almost 10 times as much memory than sparse matrix. This memory saving is very important, especially as data sets are now being created that are beyond a million cells. These matrices can become unmanageable without special computing resources.

In the sparse representation, we assume that the majority of count values in a matrix are zero. We only store the non-zero values. This is implemented in the Matrix package using a dgTMatrix object.

We can check the size of the sparve vs dense matrix using object.size() function.

1.4 Create & examine Seurat object

To analyze our single cell data we will use a seurat object. Can you create a Seurat object with the 10x data and save it in an object called ‘seurat’? hint: CreateSeuratObject().

In Seurat object, genes = Features and reads = Counts. If you want to get some basic info about the seurat object, just type the name of the object into the console.

Task: Can you include only genes that are are expressed in 3 or more cells and cells with complexity of 350 genes or more? How many genes are you left with? How many cells?

The Seurat object contains various “slots” (designated by seurat@slotname) that will store the raw count data as well as the results from downstream computations. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.

seurat@assays$RNA@counts is a slot that stores the original gene count matrix. You can expect other slots in the object using str(seurat) function.

1.5 Combine multiple data runs

You will most likely have multiple files from different experiments or sequencing runs that will need to be concatenated. Here we will load the remaining healthy and covid19 donors and create a single seurat object with all data. You could create a single vector for all paths and read the data in together if all your matrices had the same dimensions.

Note: some of these steps take a few seconds to run because of the size of the data, so just be patient.

Let's also assign the healthy and disease status to cells in the metadata.

You can use merge() function to merge Seurat objects. If you only have two objects, you can simply type them into the function (separated by a coma). If you have more than two objects, you need to make a list, e.g. merge(x=a, y=list(b,c,d,..))

Note: you will find which patient each cell came from in combined_seurat@meta.data$orig.ident.

Task: Inspect your Seurat object and report how many cells you have in each donor.

It is useful to know how to subsample your object, especially if you are working with huge cell atlases. We will subsample 10k cells from our object.

Note: set.seed() is a function used to set the seed of R's random number generator - it ensures that you select the same numbers every time you run the code. Use set.seed() function to make sure that you select the same cells each time you run your code. Please don't change it when you're going through the practical, as you're going to get different results and plots than what's in the rest of the practical. If you want a challenge, you can re-run everything with a different seed afterwards and see how things change!

OK, let's see what the distribution of cells looks after subsetting.

Task: Depending on your own data, you might want to subset evenly by patient, or cell type, or some other variable - we just sampled 10k cells from the entire dataset. can you try writing a script that would randomly sample 1k cells from each patient, and subset the original object? You will have to re-run combined_seurat = merge(healthy_seurat, disease_seurat).

Note don't forget to re-run commands below when you're ready to carry on with the rest of the practical!

combined_seurat = merge(healthy_seurat, disease_seurat) set.seed(42) samp1 = sample(colnames(combined_seurat), 10000, replace = F) combined_seurat = subset(combined_seurat, cells = samp1)